Taking a Quick Look at PDB

The PDB is the main repository of biomolecular structure data.

Here we grab the current composition statistics from the web page: https://www.rcsb.org/stats/summary

tbl <- read.csv("Data Export Summary.csv", row.names = 1)
tbl
##                          X.ray   NMR   EM Multiple.methods Neutron Other  Total
## Protein (only)          144433 11881 6732              182      70    32 163330
## Protein/Oligosaccharide   8543    31 1125                5       0     0   9704
## Protein/NA                7621   274 2165                3       0     0  10063
## Nucleic acid (only)       2396  1399   61                8       2     1   3867
## Other                      150    31    3                0       0     0    184
## Oligosaccharide (only)      11     6    0                1       0     4     22

Question 1: What percentage of structures in the PDB are solved by X-Ray and Electron Microscopy.

#Check the sums of all the columns in the data set
#colSums(tbl)

#Sum the relevant columns and divide that number by the sum of the "total" column, multiplying the answer by 100 to achieve a percentage
n.type <- colSums(tbl)
n.type / n.type["Total"] * 100
##            X.ray              NMR               EM Multiple.methods 
##      87.16888390       7.27787573       5.38868408       0.10632046 
##          Neutron            Other            Total 
##       0.03846770       0.01976813     100.00000000
#If we were to use the above method the generate the answer for the question, we would want store n.type / n.type["Total"] * 100 to a variable and then type 'r variable[1]' to output the X-ray percentage and 'r variable[3]' to output the EM percentage, since the output is a vector with sevaral values with discrete locations

#The less elegant way I came up with
XR <- sum(tbl[,1]) / sum(tbl[,7]) * 100 
EM <- sum(tbl[,3]) / sum(tbl[,7]) * 100 
XR
## [1] 87.16888
EM
## [1] 5.388684
#How do we get an output with only 3 decimal places?
XRr <- round(XR, digits = 3)
XRr
## [1] 87.169
EMr <- round(EM, digits = 3)
EMr
## [1] 5.389

The proportion of of X-ray structures is 87.169% of the total structures

The proportion of of EM structures is 5.389% of the total structures

Question 2: What proportion of structures in the PDB are protein?

#Take the total number of protein entries (located in row 1, column 7) and divide it by the sum of the total column
Prot <- round(tbl[1,7] / sum(tbl[,7]) * 100, digits = 3)
Prot
## [1] 87.263
#Barry's more elegant solution
#tbl$Total[1]
#This allows you to not have to know the column number, just the name, and you can still specify the row number you want to access
#This also protects you from issues if the database changes at all, still searching for the 'Total' column regardless of the column position

The proportion of entries that are protein structures is 87.263%

Question 3: Type HIV in the PDB website search box on the home page and determine how many HIV-1 protease structures are in the current PDB?

Inserting an Image File